BIG6 Associates_4-step approach

In [1]:
#For Step 1
import folium                     # Import folium data
import geopandas as gpd           # Import data visualisation functions
In [2]:
#For the rest of the phases
import pandas as pd               # Import pandas package
import numpy as np                # Import numpy package
import matplotlib.pyplot as plt   # Import data visualation functions
import seaborn as sns             # Import data visualation functions
import plotly.graph_objects as go # Import data visualation functions
import warnings as warnings       # Import data visualation functions
import statsmodels.formula.api as smf # Import stats modelling package for OLS regression
from sklearn.model_selection import train_test_split  # Import testing for model
from sklearn.preprocessing import StandardScaler  # Import Standard Scaler to standardise data

Step 1: Determining the optimal US city for Python Inc. to set up its car dealership

1.1 Determining the US state to focus on

Since Python Inc. is a luxury car dealer, we will identify its target group of high-income consumers who are more likely to purchase luxury cars. Hence, we select the US state with the highest median household income as it is a proxy for the average purchasing power of residents living in a particular area.

We utilised a combination of three datasets:

  1. Longitude and Latitude of US States
  2. Median Household Income of US States
  3. US State Codes

We begin by cleaning the data and merging the three datasets together.

In [3]:
df = pd.read_csv('Income Dataset.csv')
stateslatlong = pd.read_csv('States Lat Long.csv')
stateslatlong.head()
Out[3]:
Place Name Latitude Longitude
0 Wisconsin, the USA 44.5 -89.500000
1 West Virginia, the US 39.0 -80.500000
2 Vermont, the USA 44.0 -72.699997
3 Texas, the USA 31.0 -100.000000
4 South Dakota, the US 44.5 -100.000000
In [4]:
df.head() 
Out[4]:
State Median Household Income
0 Alabama $55,935.00
1 Alaska $78,038.00
2 Arizona $70,080.00
3 Arkansas $54,322.00
4 California $78,002.00
In [5]:
# Clean states lat long data to merge with household income
stateslatlong.columns = ['State', 'Latitude','Longitude']
namelist = stateslatlong['State'].tolist()
namelistnew = []
for i in namelist:
    namelistnew.append(i.split(',')[0])
stateslatlong['State'] = namelistnew
In [6]:
#Rename Washington State to Washington for consistency across 2 datasets
stateslatlong.loc[(stateslatlong['State'] == 'Washington State'),'State'] = 'Washington'
In [7]:
#merge data sets
df_combined = pd.merge(left=df, right=stateslatlong, how = 'left' ,left_on='State', right_on='State')
In [8]:
# District of Columbia is missing but can remove because it is the old name of Washington
docindex = df_combined[(df_combined.State == 'District of Columbia')].index
docindex

df_combined = df_combined.drop(docindex).reset_index(drop = True)
df_combined.head()
Out[8]:
State Median Household Income Latitude Longitude
0 Alabama $55,935.00 32.318230 -86.902298
1 Alaska $78,038.00 66.160507 -153.369141
2 Arizona $70,080.00 34.048927 -111.093735
3 Arkansas $54,322.00 34.799999 -92.199997
4 California $78,002.00 36.778259 -119.417931
In [9]:
# Change the values of Median Household Income from str to float 
emplist = []
for x in df_combined["Median Household Income"]:
    y = x.replace("$","")
    z = y.replace(",","")
    emplist.append(float(z))
df_combined["Median Household Income"] = pd.Series(emplist)
df_combined.head()
Out[9]:
State Median Household Income Latitude Longitude
0 Alabama 55935.0 32.318230 -86.902298
1 Alaska 78038.0 66.160507 -153.369141
2 Arizona 70080.0 34.048927 -111.093735
3 Arkansas 54322.0 34.799999 -92.199997
4 California 78002.0 36.778259 -119.417931
In [10]:
# Insert the state codes into the dataset 
codes = pd.read_csv('states.csv')
codes.head()
Out[10]:
State Abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
In [11]:
#Remove District of Columbia from this code dataset 
docindex = codes[(codes.State == 'District of Columbia')].index
docindex

codes = codes.drop(docindex).reset_index(drop = True)
In [12]:
#Merging 3 datasets
df_combined.insert(0, "Abbreviation", codes["Abbreviation"], allow_duplicates=False)
df_combined.head()
Out[12]:
Abbreviation State Median Household Income Latitude Longitude
0 AL Alabama 55935.0 32.318230 -86.902298
1 AK Alaska 78038.0 66.160507 -153.369141
2 AZ Arizona 70080.0 34.048927 -111.093735
3 AR Arkansas 54322.0 34.799999 -92.199997
4 CA California 78002.0 36.778259 -119.417931

Plotting an interactive map to show median household income across US states.

In [13]:
poly = gpd.read_file('us-states.json')
m = folium.Map(location=[44.788689, -95.501507], zoom_start=3.5)

for i in range(0, len(df_combined)):
    folium.Marker(df_combined.iloc[i,3:5].tolist(), popup='Median Household Income in ' + '<b>' +\
              df_combined.iloc[i,1] + '<b>'+ '<br>' + '<br>'+ '<b>' + str(df_combined.iloc[i,2]) +'<b>').add_to(m)

for i in range(0, len(poly)):
    folium.GeoJson(poly['geometry'][i]).add_to(m)

m
Out[13]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Plotting a chloropleth graph of median household income across US states.

In [14]:
#plotting chloropleth graph
x = df_combined

fig = go.Figure(data=go.Choropleth(
    locations=x['Abbreviation'], # Spatial coordinates
    z = x['Median Household Income'].astype(float), # Data to be color-coded
    locationmode = 'USA-states', # set of locations match entries in `locations`
    colorscale = 'Reds',
    colorbar_title = "Millions USD",
))

fig.update_layout(
    title_text = 'Median Household Income of USA States',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

Plotting a lollipop chart to visualise median household income across US states.

In [15]:
# plotting median household income across US states
df = pd.DataFrame({'group':df_combined["State"], 'values':df_combined["Median Household Income"] })
 
# Reorder it following the values:
ordered_df = df.sort_values(by='values')
my_range=range(1,len(df.index)+1)
 
# The vertival plot is made using the hline function
sns.set(style='white', font='Arial',font_scale=1.3)
plt.figure(figsize=(18,20))
color = ['skyblue' if (x < max(ordered_df['values'])) 
         else 'red' for x in ordered_df['values']]
plt.hlines(y=my_range, xmin=0, xmax=ordered_df['values'], color=color)
plt.plot(ordered_df['values'], my_range, "o", color="darkblue")
 
# Add titles and axis names
plt.yticks(my_range, ordered_df['group'], size=12)
plt.title("Median Household Income of US States", loc='center',size=24, fontweight="bold")
plt.xlabel('Median Household Income',size=20)
plt.ylabel('US States',size=20)
    
plt.show()

From the graphs above, we identify that the US State with the highest median household income is Maryland.

1.2 Identifying a suitable city in Maryland to set up operations

In Maryland, 31.8 out of 100 residents own a registered car. Assuming median household income and vehicle ownership per capita does not fluctatuate much across various cities in Maryland, we select the city in Maryland with the highest population as it translates into the highest number of vehicle owners in the state. We also consider population density of each city in Maryland to determine the estimated number of vehicle owners per unit area of land.

We use a separate dataset indicating the population size and population density of each city in Maryland.

First, we clean the dataset.

In [16]:
cities = pd.read_csv('Maryland Cities Population.csv')
cities.head()
Out[16]:
City Population
0 Baltimore 593,490
1 Columbia 103,663
2 Germantown 90,844
3 Silver Spring 79,750
4 Waldorf 74,587
In [17]:
#change the values of population from str to float
emplist1 = []
for x in cities["Population"]:
    z = x.replace("$","")
    zz = z.replace(",","")
    emplist1.append(float(zz))
cities["Population"] = pd.Series(emplist1)
cities.head()
Out[17]:
City Population
0 Baltimore 593490.0
1 Columbia 103663.0
2 Germantown 90844.0
3 Silver Spring 79750.0
4 Waldorf 74587.0

To calculate the population density, we used data from the Census Bureau to determine the land area for each of the top 10 cities.

In [18]:
pop_d = pd.read_csv('pop_density.csv')
pop_d
Out[18]:
City Land Area in Km^2
0 Baltimore 239.00
1 Columbia 83.40
2 Germantown 28.00
3 Silver Spring 20.50
4 Waldorf 94.53
5 Ellicott City 77.90
6 Frederick 57.50
7 Glen Burnie 46.70
8 Rockville 35.33
9 Gaithersburg 27.04
In [19]:
#change the values of population from str to float
emplist2 = []
for x in pop_d["Land Area in Km^2"]:
    zz = float(x)
    emplist2.append(zz)
pop_d["Land Area in Km^2"] = pd.Series(emplist2)
pop_d.head()
Out[19]:
City Land Area in Km^2
0 Baltimore 239.00
1 Columbia 83.40
2 Germantown 28.00
3 Silver Spring 20.50
4 Waldorf 94.53

To calculate population density, we divided population size by land area.

In [20]:
top10bypopulation = cities.copy()
top10bypopulation = top10bypopulation.head(10)
top10bypopulation['Land Area in km^2'] = pop_d["Land Area in Km^2"]
top10bypopulation['Population Density (km^2)'] = top10bypopulation['Population']/top10bypopulation['Land Area in km^2']
top10bypopulation
Out[20]:
City Population Land Area in km^2 Population Density (km^2)
0 Baltimore 593490.0 239.00 2483.221757
1 Columbia 103663.0 83.40 1242.961631
2 Germantown 90844.0 28.00 3244.428571
3 Silver Spring 79750.0 20.50 3890.243902
4 Waldorf 74587.0 94.53 789.029938
5 Ellicott City 72247.0 77.90 927.432606
6 Frederick 72244.0 57.50 1256.417391
7 Glen Burnie 69813.0 46.70 1494.925054
8 Rockville 68079.0 35.33 1926.945938
9 Gaithersburg 67985.0 27.04 2514.238166

Bar and line plot to visualise the population size and population density of cities in Maryland.

In [21]:
#Plotting
sns.set(style='white', font='Arial',font_scale=1.3)
fig,ax = plt.subplots(figsize=(20,5))
d = top10bypopulation['Population']
color = ['grey' if (x < max(d)) else '#E83D3D' for x in d ]
ax.bar(top10bypopulation['City'], top10bypopulation['Population'], width = 0.8, color = color,alpha=0.8)
ax2 = ax.twinx()
sns.lineplot(x = top10bypopulation['City'] , y= top10bypopulation['Population Density (km^2)'], ax = ax2, color = 'blue')

#Labelling and aesthetics
ax.set_xlabel('Cities in Maryland',fontsize=17)
ax.set_ylabel('Population',fontsize=17)
ax2.set_ylabel('Population Density (km^2)',fontsize=15)
plt.title('Population of Cities in Maryland',fontsize=20,fontweight="bold")
plt.xticks(fontsize=13, rotation=90)
plt.show()

Insight: As seen from the graph above, Baltimore is the city with the largest population and a relatively high population density.

As luxury cars are in a niche market, consumers are more likely to travel a longer distance to purchase cars from a luxury car dealership. This means that our client will still be able to capture the consumer base farther away from its dealership location, which diminishes the impact of its slightly lower population density. Furthermore, given that Baltimore has the largest land area, its residents have a greater need for a car as a means of transport around the city. Thus, we overweight the larger population size compared to the slightly lower population density as it signifies a more substantial luxury car consumer base.

Since Maryland is the state with the highest median household income in the US, coupled with the fact that Baltimore has the highest vehicle ownership, we infer that there is a sizable market for luxury cars in Baltimore, Maryland. Hence, we would recommend for Python Inc. to consider setting up operations there.

Step 2: Choosing the top luxury car brands for Python Inc. to sell

2.1 Initial Data Preparation

For the rest of this report, we will be analysing US car sales and will be preparing our sample data for the relevant analysis

We utilised one key dataset:

  1. US Car Sales

Reading data

In [22]:
data = pd.read_csv('Car_Sales (US).csv')
print('NaN values in each column:')
print(data.isna().sum())                                       # Data before cleaning
NaN values in each column:
MSRP                    0
Year                    0
Engine_Fuel_Type        3
Engine_HP              69
Engine_Cylinders       30
Transmission_Type       0
Driven_Wheels           0
Number_of_Doors         6
Vehicle_Size            0
Vehicle_Style           0
Highway_MPG             0
City_MPG                0
Popularity              0
Make                    0
Model                   0
Market_Category      3742
dtype: int64

To clear NaN values from our car dataset

In [23]:
data = pd.read_csv('Car_Sales (US).csv')

def clean_dataset(df):
    df.dropna(inplace=True)                                      # Remove NaN values
    df = df.loc[df['Transmission_Type']!='UNKNOWN']              # Remove "UNKNOWN" values for "Transmission_Types"
    df = df.loc[df['Engine_Cylinders']!=0]                       # Remove "0" values for "Engine_Cylinders"
    return df

data_cleaned = clean_dataset(data)

print('NaN values in each column:')
print(data_cleaned.isna().sum())
NaN values in each column:
MSRP                 0
Year                 0
Engine_Fuel_Type     0
Engine_HP            0
Engine_Cylinders     0
Transmission_Type    0
Driven_Wheels        0
Number_of_Doors      0
Vehicle_Size         0
Vehicle_Style        0
Highway_MPG          0
City_MPG             0
Popularity           0
Make                 0
Model                0
Market_Category      0
dtype: int64
In [24]:
# To find out the total number of cars left in our sample
total_rows = len(data_cleaned.axes[0]) 
print('No of rows: '+ str(total_rows))
No of rows: 8068

Deriving sample sales volume and average MSRP for each brand

In [25]:
# Creating No. of transactons for each brand
car_sales = pd.DataFrame(data_cleaned['Make'].value_counts())
car_sales.columns =['Number of cars'] # Renaming the column index
car_sales.head()
Out[25]:
Number of cars
Chevrolet 608
Volkswagen 581
Ford 492
Cadillac 397
Mercedes-Benz 349
In [26]:
car_brands = list(data_cleaned['Make'].unique())        # Creating a list of each of the car_brands
avg_brand_MSRP = []                                     # Empty list for appending later
sales_quantity = []                                     # Empty list for appending later
sales_quantity_perc = []                                # Empty list for appending later
car_popularity = []


for item in car_brands:
       
    """This section is for finding the mean MSRP for each car brand"""
    brand_selected = data_cleaned['Make'] == item                # Isolate the rows to the specific car brand
    brand_mean = data_cleaned.loc[brand_selected]['MSRP'].mean() # Finding the mean MSRP for the car brand
    avg_brand_MSRP.append(brand_mean)                            # Appending mean MSRP for the Brand to the list 
    
    """This section is for finding the number of cars per brand in this sample"""
    sales_quantity.append(car_sales.loc[item]['Number of cars']) # The quantity of each type of car sold
    sales_quantity_perc.append((car_sales.loc[item]['Number of cars'] / total_rows) * 100 ) # Quantity as a percentage
    
    """Car Popularity"""
    car_popularity.append(data_cleaned.loc[brand_selected]['Popularity'].mean())

Brand_MSRP = pd.DataFrame()                             # Creation of empty data frame

Brand_MSRP['Brand'] = car_brands 
Brand_MSRP['Average MSRP'] = avg_brand_MSRP
Brand_MSRP['Quantity Sold'] = sales_quantity
Brand_MSRP['Quantity Sold (% of total)'] = sales_quantity_perc
Brand_MSRP['Popularity'] = car_popularity
Brand_MSRP.head()
Out[26]:
Brand Average MSRP Quantity Sold Quantity Sold (% of total) Popularity
0 BMW 61775.209091 330 4.090233 3916.0
1 Audi 53452.112805 328 4.065444 3105.0
2 FIAT 22370.657895 38 0.470997 819.0
3 Mercedes-Benz 71800.885387 349 4.325731 617.0
4 Chrysler 29978.870370 108 1.338622 1013.0

With the creation of data frames for average MSRP, popularity, and quantity sold, we now have the relevant metrics to determine which luxury brands to recommend for Python Inc.

2.2 Grouping of all brands into 3 main categories

For our data set, we are using price to segment all car brands int 3 types of cars that US consumers will buy, namely:

  1. Budget Cars
  2. Mid-tier
  3. Luxury
In [27]:
#With average MSRP of all brands, we sort into 3 bands
sorted_msrp = Brand_MSRP.sort_values(by= ['Average MSRP'],inplace=False)

Data visulisation of Average MSRP per brand

In [28]:
#preparing data
xaxislabels = sorted_msrp['Brand']

#plotting
sns.set(style='white', font='Arial',font_scale=1)
plt.figure(figsize=(18,4))
plt.bar(xaxislabels,sorted_msrp['Average MSRP'], color= '#db991a', alpha = 0.6,edgecolor='black')
plt.scatter(xaxislabels,sorted_msrp['Average MSRP'], marker='x', c= '#633193', alpha = 1)

#Labelling and aesthetics
plt.xlabel('Car Brand', fontsize = 15)
plt.xticks(rotation = 90)
plt.title('Car Brands and their average MSRP',fontsize=18, fontweight="bold")
plt.ylabel('Average MSRP' , fontsize = 15)
plt.show()

Based on the distribution above, we will group them into 3 main categories via 3 main price points:

  1. Budget: p < 40,000
  2. Mid-tier: 40,000 < p < 100,000
  3. Luxury: p > 100,000

We keep in mind that Bugatti's price is considered ultra-luxury as their average prices are extremely high

In [29]:
#Adding the cateogories into Sorted_MSRP dataframe
sorted_msrp["Car Category"] = "budget"
is_middle = (sorted_msrp['Average MSRP'] > 40000) & (sorted_msrp['Average MSRP'] <100000)
is_luxury = sorted_msrp['Average MSRP'] >= 100000

sorted_msrp.loc[is_middle,"Car Category"] = 'mid-tier'
sorted_msrp.loc[is_luxury,"Car Category"] = 'luxury'
sorted_msrp.head()
Out[29]:
Brand Average MSRP Quantity Sold Quantity Sold (% of total) Popularity Car Category
33 Plymouth 4076.820513 39 0.483391 535.0 budget
8 Mitsubishi 20221.880000 125 1.549331 436.0 budget
42 Scion 20395.937500 48 0.594943 105.0 budget
44 Suzuki 21153.050505 99 1.227070 481.0 budget
2 FIAT 22370.657895 38 0.470997 819.0 budget

Visualisation of Car tiers

In [30]:
#preparing data
groups = sorted_msrp.groupby("Car Category",sort=False)
colors = ['#62411f','#f68e43','#221dbc']  #our desired colors

#plotting
sns.set(style='white', font='Arial',font_scale=1)
plt.figure(figsize=(18,4))
i = 0
for name, group in groups:
    plt.bar(group["Brand"], group["Average MSRP"],color = colors[i],label=name, edgecolor='black')
    i = i + 1
    
#Labelling and aesthetics
plt.xlabel('Car Brand', fontsize = 15)
plt.xticks(fontsize= 12, rotation = 90)
plt.title('Car Brands in each category',fontsize=18, fontweight='bold')
plt.ylabel('Average MSRP' , fontsize = 15)
plt.legend()
plt.show()

Slicing to attain Luxury brands

In [31]:
# We filter out the cars to only luxury brands, and then select top 3 brands to analyse
luxury_only = sorted_msrp.loc[sorted_msrp['Car Category'] == 'luxury']

#sorting brands by popularity
pop_sorted = luxury_only.sort_values(by= ['Popularity'],inplace=False)
pop_sorted.head()
Out[31]:
Brand Average MSRP Quantity Sold Quantity Sold (% of total) Popularity Car Category
30 Spyker 213323.333333 3 0.037184 2.0 luxury
12 Maybach 546221.875000 16 0.198314 67.0 luxury
34 Rolls-Royce 351130.645161 31 0.384234 86.0 luxury
35 Maserati 114207.706897 58 0.718889 238.0 luxury
37 Aston Martin 197910.376344 93 1.152702 259.0 luxury

2.3 Brand vs Popularity

Visualisation of brands with the highest popularity

In [32]:
#preparing data
top5luxpop = pop_sorted['Popularity'] > 500
xaxislabels = pop_sorted['Brand']
pop_adjusted = pop_sorted['Popularity'].loc[top5luxpop]
xaxislabels_adjusted = xaxislabels.loc[top5luxpop]

#plotting
sns.set(style='white', font='Arial',font_scale=1)
plt.figure(figsize=(15,4))
plt.plot(xaxislabels, pop_sorted['Popularity'] , c= 'b', alpha = 0.7, 
         label = 'Popularity', linewidth = 1.25, linestyle = '--') 
plt.scatter(xaxislabels_adjusted, pop_adjusted, c= 'r', alpha = 1, marker = "*", label ="Top 5 Brands", s = 100)
plt.fill_between(xaxislabels_adjusted, pop_adjusted, alpha = 0.15 , color = 'g')

#Labelling and aesthetics
plt.xlabel('Brand', fontsize = 14)
plt.xticks(rotation = 0)
plt.ylabel('Popularity' , fontsize = 14)
plt.title('Luxury car brands sorted by popularity',fontsize=18, fontweight='bold')
plt.legend()
plt.show()

From the above, we narrow down our selection to the 5 most popular luxury brands: Bently, Bugatti, Lamborghini, Porsche and Ferrari.

2.4 Brand vs Quantity

Visalization of brands that have the highest sales volume

In [33]:
#Sorting dataframe by quantity
qty_sorted = luxury_only.sort_values(by= ['Quantity Sold (% of total)'],inplace=False)
qty_sorted.head()
Out[33]:
Brand Average MSRP Quantity Sold Quantity Sold (% of total) Popularity Car Category
30 Spyker 2.133233e+05 3 0.037184 2.0 luxury
46 Bugatti 1.757224e+06 3 0.037184 820.0 luxury
11 McLaren 2.398050e+05 5 0.061973 416.0 luxury
12 Maybach 5.462219e+05 16 0.198314 67.0 luxury
34 Rolls-Royce 3.511306e+05 31 0.384234 86.0 luxury
In [34]:
#preparing data
xaxislabels = qty_sorted['Brand']

#plotting
sns.set(style='white', font='Arial',font_scale=1)
plt.figure(figsize=(15,4))
plt.bar(xaxislabels, qty_sorted['Quantity Sold (% of total)'], color= 'r', alpha = 0.5, edgecolor='black')

#Labelling and aesthetics
plt.xlabel('Luxury Car Brand', fontsize = 12)
plt.ylabel('Quantity Sold (% of total)' , fontsize = 12)
plt.title('Luxury car brands sorted by quantity',fontsize=16, fontweight= 'bold')
plt.show()
In [35]:
#Preparing data
data_pie2 = qty_sorted['Quantity Sold']
brands_pie2 = qty_sorted['Brand']

#Plotting
fig = go.Figure(data=[go.Pie(labels=brands_pie2, values=data_pie2, textinfo='label+percent',
                             insidetextorientation='radial')])

#Labelling and aesthetics
fig.update_traces(hole=.3, hoverinfo="label+percent+name")
fig.update_layout(title_text="Proportion of Luxury cars Sold for each Brand")
fig.show()

Next, by comparing quantity sold across all brands, we determine the top 5 brands sold to be: Maserati, Ferrari, Bently, Aston Martin and Porsche

2.5 Finding the top 3 luxury brands

By corroborating our findings for the top 5 brands in both popularity and quantity sold, we can determine the top 3 car brands in the luxury car market.

In [36]:
#Top 5 Cars by Popularity
top_5_pop = pop_sorted[len(pop_sorted)-5:len(pop_sorted)]
top_5_pop = list(top_5_pop['Brand'])
print('The most popular luxury brands are: '+ str(top_5_pop))
The most popular luxury brands are: ['Bentley', 'Bugatti', 'Lamborghini', 'Porsche', 'Ferrari']
In [37]:
#Top 5 Cars by Quantity
top_5_qty = qty_sorted[len(qty_sorted)-5:len(qty_sorted)]
top_5_qty = list(top_5_qty['Brand'])
print('The most sold luxury brands are: '+ str(top_5_qty))
The most sold luxury brands are: ['Maserati', 'Ferrari', 'Bentley', 'Aston Martin', 'Porsche']
In [38]:
#Finding intersection of the 2 data frames
top_3_brands = []

for i in luxury_only['Brand']:
    if i in top_5_qty and i in top_5_pop:
        top_3_brands.append(i)
    else:
        continue
top_3_brands
Out[38]:
['Porsche', 'Ferrari', 'Bentley']
In [39]:
#Preparing data
brands = pop_sorted['Brand']
x = pop_sorted['Quantity Sold']
y = pop_sorted['Popularity']
z = pop_sorted['Average MSRP']
z= z/200

#Plotting
sns.set(style='white', font='Arial',font_scale=1.3)
plt.figure(figsize=(15,6))
# Change color with c and alpha. I map the color to the X axis value.
plt.scatter(x, y, s=z, c=y, cmap="summer_r", alpha=0.4, edgecolors="grey", linewidth=2)

# zip joins x,y coordinates in pairs with brand names
for i,j,k in zip(x,y,brands):
    plt.annotate(k, # this is the text
                 (i,j), # this is the point to label
                 textcoords="offset points", # how to position the text
                 xytext=(0,0), # distance from text to points (x,y)
                 ha='center',fontsize= 9,fontweight= 'bold') # horizontal alignment can be left, right or center

plt.ylim(-600, 3050)
plt.xlim(-20,150)
plt.xlabel("Quantity Sold", fontsize= 12)
plt.ylabel("Popularity", fontsize= 12)
plt.title('Popularity of Brand v.s. Quantity Sold' , fontsize = 18, fontweight= 'bold')
plt.show()

Insight: From our analysis we determine Porsche, Ferrari and Bentley to be the top 3 brands to be sold

Step 3: Variable Selection

To determine what are the attributes for luxury cars which correlates to higher MSRP so we can set higher prices hence more profits for the brands chosen

Filtering cleaned dataset to get luxury car data

In [40]:
#Dataframe for luxury brands and their attributes
luxury_brands = ['Porsche', 'Maserati', 'Aston Martin', 'Spyker', 
                 'Ferrari', 'McLaren', 'Bentley', 'Lamborghini', 'Rolls-Royce', 'Maybach', 'Bugatti']
    
is_luxury = data_cleaned['Make'].isin(luxury_brands)
is_luxury
luxury_data = data_cleaned.loc[is_luxury]
luxury_data.head()
Out[40]:
MSRP Year Engine_Fuel_Type Engine_HP Engine_Cylinders Transmission_Type Driven_Wheels Number_of_Doors Vehicle_Size Vehicle_Style Highway_MPG City_MPG Popularity Make Model Market_Category
294 160829 2002 premium unleaded (required) 400.0 8.0 MANUAL rear wheel drive 2.0 Compact Convertible 15 10 2774 Ferrari 360 Exotic,High-Performance
295 140615 2002 premium unleaded (required) 400.0 8.0 MANUAL rear wheel drive 2.0 Compact Coupe 15 10 2774 Ferrari 360 Exotic,High-Performance
296 150694 2002 premium unleaded (required) 400.0 8.0 AUTOMATED_MANUAL rear wheel drive 2.0 Compact Coupe 15 10 2774 Ferrari 360 Exotic,High-Performance
297 170829 2002 premium unleaded (required) 400.0 8.0 AUTOMATED_MANUAL rear wheel drive 2.0 Compact Convertible 15 10 2774 Ferrari 360 Exotic,High-Performance
298 165986 2003 premium unleaded (required) 400.0 8.0 MANUAL rear wheel drive 2.0 Compact Convertible 15 10 2774 Ferrari 360 Exotic,High-Performance

3.1 Correlation Matrix for luxury car brands

In [41]:
#Preparing data
luxury_corr = luxury_data.corr()

#Plotting
sns.set(style='white', font='Arial',font_scale=1.3)
f,ax=plt.subplots(figsize=(11,6))
sns.heatmap(luxury_corr, cmap='rocket_r', annot=True, linewidths=.3)

#Labelling and aesthetics
plt.title("Correlation between features (Luxury Cars)", 
          weight='bold', 
          fontsize=18)
plt.show()

To find variables that affected price the most, we considered the variables with a Pearson's coefficient larger than 0.500 in absolute terms. As Seen above, 'Engine_HP' and 'Engine_Cylinders' correlates the most to MSRP.

However, looking at the correlation coefficient is insufficient to conclude whether a variable such as the engine horsepower can affect the MSRP for luxury cars. Hence, we will run a multiple regression analysis to confirm this relationship.

3.2 Multiple Linear Regression to find factors that affect MSRP the most

In [42]:
#OLS, Multiple Linear Regression
model = smf.ols('MSRP ~ Engine_HP + Year + C(Transmission_Type) + C(Driven_Wheels) + C(Vehicle_Size) + C(Vehicle_Style) \
                + C(Engine_Fuel_Type) + Popularity + City_MPG + Highway_MPG + \
                Engine_Cylinders + Number_of_Doors', data=luxury_data)
result = model.fit()
print(result.summary())

pvalues = result.pvalues

print('\n'*2)

coeff_var = []
pvalues = result.pvalues
coeff = result.params
coeff.sort_values(inplace=True)

for i in coeff.index:
    if pvalues[i] < 0.05:
        coeff_var.append(i)
print('Top variables ranked by coefficients and Low P-Values:\n')
print(result.params[coeff_var])
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   MSRP   R-squared:                       0.658
Model:                            OLS   Adj. R-squared:                  0.645
Method:                 Least Squares   F-statistic:                     52.50
Date:                Thu, 12 Nov 2020   Prob (F-statistic):          1.35e-107
Time:                        11:09:10   Log-Likelihood:                -7009.7
No. Observations:                 539   AIC:                         1.406e+04
Df Residuals:                     519   BIC:                         1.415e+04
Df Model:                          19                                         
Covariance Type:            nonrobust                                         
====================================================================================================================================
                                                                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------------------------
Intercept                                                         1.855e+07   4.27e+06      4.345      0.000    1.02e+07    2.69e+07
C(Transmission_Type)[T.AUTOMATIC]                                 3506.3534   1.79e+04      0.196      0.844   -3.16e+04    3.86e+04
C(Transmission_Type)[T.MANUAL]                                    5087.5425   1.45e+04      0.352      0.725   -2.33e+04    3.35e+04
C(Driven_Wheels)[T.rear wheel drive]                             -7016.0267   1.42e+04     -0.495      0.621   -3.49e+04    2.09e+04
C(Vehicle_Size)[T.Large]                                          6.528e+04   2.29e+04      2.856      0.004    2.04e+04     1.1e+05
C(Vehicle_Size)[T.Midsize]                                       -2.173e+04   1.85e+04     -1.172      0.242   -5.81e+04    1.47e+04
C(Vehicle_Style)[T.4dr SUV]                                      -8.102e+05   1.07e+05     -7.596      0.000   -1.02e+06   -6.01e+05
C(Vehicle_Style)[T.Convertible]                                   8.804e+04   6.33e+04      1.391      0.165   -3.63e+04    2.12e+05
C(Vehicle_Style)[T.Coupe]                                         6.937e+04   6.35e+04      1.092      0.275   -5.54e+04    1.94e+05
C(Vehicle_Style)[T.Sedan]                                        -8.679e+05   1.01e+05     -8.552      0.000   -1.07e+06   -6.69e+05
C(Engine_Fuel_Type)[T.flex-fuel (premium unleaded required/E85)] -2.822e+05   8.61e+04     -3.276      0.001   -4.51e+05   -1.13e+05
C(Engine_Fuel_Type)[T.premium unleaded (required)]               -1.437e+05   8.28e+04     -1.736      0.083   -3.06e+05    1.89e+04
C(Engine_Fuel_Type)[T.regular unleaded]                          -2.062e+05   9.78e+04     -2.110      0.035   -3.98e+05   -1.42e+04
Engine_HP                                                         1124.6691     91.477     12.295      0.000     944.958    1304.380
Year                                                             -9896.2541   2145.831     -4.612      0.000   -1.41e+04   -5680.673
Popularity                                                         -15.2481      7.394     -2.062      0.040     -29.775      -0.721
City_MPG                                                          6374.1971   6776.478      0.941      0.347   -6938.501    1.97e+04
Highway_MPG                                                       1166.2873   4772.810      0.244      0.807   -8210.114    1.05e+04
Engine_Cylinders                                                  1.194e+04   3839.980      3.110      0.002    4398.712    1.95e+04
Number_of_Doors                                                   4.532e+05   4.01e+04     11.299      0.000    3.74e+05    5.32e+05
==============================================================================
Omnibus:                      638.993   Durbin-Watson:                   1.154
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            63005.546
Skew:                           5.586   Prob(JB):                         0.00
Kurtosis:                      54.775   Cond. No.                     2.13e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.13e+06. This might indicate that there are
strong multicollinearity or other numerical problems.



Top variables ranked by coefficients and Low P-Values:

C(Vehicle_Style)[T.Sedan]                                          -8.679465e+05
C(Vehicle_Style)[T.4dr SUV]                                        -8.101977e+05
C(Engine_Fuel_Type)[T.flex-fuel (premium unleaded required/E85)]   -2.821706e+05
C(Engine_Fuel_Type)[T.regular unleaded]                            -2.062496e+05
Year                                                               -9.896254e+03
Popularity                                                         -1.524813e+01
Engine_HP                                                           1.124669e+03
Engine_Cylinders                                                    1.194253e+04
C(Vehicle_Size)[T.Large]                                            6.528444e+04
Number_of_Doors                                                     4.532173e+05
Intercept                                                           1.854672e+07
dtype: float64

3.3 Variable selection by deleting variables that fail our hypothesis test

From our original regression model, we remove Driven Wheels, Transmission_Type, City_MPG, and Highway_MPG as they fail our hypothesis test at 5% level of significance, hence meaning their beta-hat is likely to be zero

In [43]:
model = smf.ols('MSRP ~ Engine_HP + Year + C(Vehicle_Size) + C(Vehicle_Style) + C(Engine_Fuel_Type) \
                + Popularity + Engine_Cylinders + Number_of_Doors', data=luxury_data)
                
result_2 = model.fit()
print(result_2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   MSRP   R-squared:                       0.654
Model:                            OLS   Adj. R-squared:                  0.644
Method:                 Least Squares   F-statistic:                     70.61
Date:                Thu, 12 Nov 2020   Prob (F-statistic):          9.23e-111
Time:                        11:09:11   Log-Likelihood:                -7012.9
No. Observations:                 539   AIC:                         1.406e+04
Df Residuals:                     524   BIC:                         1.412e+04
Df Model:                          14                                         
Covariance Type:            nonrobust                                         
====================================================================================================================================
                                                                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------------------------
Intercept                                                         1.122e+07   2.79e+06      4.023      0.000    5.74e+06    1.67e+07
C(Vehicle_Size)[T.Large]                                          5.416e+04   1.89e+04      2.866      0.004     1.7e+04    9.13e+04
C(Vehicle_Size)[T.Midsize]                                       -2.399e+04   1.49e+04     -1.606      0.109   -5.33e+04    5352.178
C(Vehicle_Style)[T.4dr SUV]                                      -8.146e+05   1.06e+05     -7.684      0.000   -1.02e+06   -6.06e+05
C(Vehicle_Style)[T.Convertible]                                   9.353e+04    6.3e+04      1.484      0.138   -3.03e+04    2.17e+05
C(Vehicle_Style)[T.Coupe]                                         7.799e+04   6.32e+04      1.234      0.218   -4.61e+04    2.02e+05
C(Vehicle_Style)[T.Sedan]                                        -8.631e+05   1.01e+05     -8.529      0.000   -1.06e+06   -6.64e+05
C(Engine_Fuel_Type)[T.flex-fuel (premium unleaded required/E85)] -2.921e+05   8.55e+04     -3.415      0.001    -4.6e+05   -1.24e+05
C(Engine_Fuel_Type)[T.premium unleaded (required)]               -1.653e+05   8.19e+04     -2.018      0.044   -3.26e+05   -4419.929
C(Engine_Fuel_Type)[T.regular unleaded]                          -1.773e+05   9.53e+04     -1.860      0.063   -3.65e+05    9958.730
Engine_HP                                                         1054.6041     70.864     14.882      0.000     915.392    1193.816
Year                                                             -6171.4300   1379.902     -4.472      0.000   -8882.250   -3460.610
Popularity                                                         -11.8210      6.469     -1.827      0.068     -24.530       0.888
Engine_Cylinders                                                  8904.0174   3450.634      2.580      0.010    2125.242    1.57e+04
Number_of_Doors                                                   4.577e+05      4e+04     11.432      0.000    3.79e+05    5.36e+05
==============================================================================
Omnibus:                      657.325   Durbin-Watson:                   1.123
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            70658.011
Skew:                           5.846   Prob(JB):                         0.00
Kurtosis:                      57.859   Cond. No.                     1.39e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.39e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

Conclusion: Looking at p-values, for p-value > 0.05, we drop Vehicle size 'Midsize', Vehicle Style 'Couple' and 'Sedan', Engine Fuel Type (Regular Unleaded) and Popularity, as they fail our hypothesis test at 5% level of significance

3.4 Adding dummy variables to fine-tune variable selection for categorical data


In [44]:
df = luxury_data
In [45]:
# Modify data frame by adding dummy variables
df2 = pd.concat([df, pd.get_dummies(df['Vehicle_Size'].astype('category'), prefix = 'd')], axis = 1)
df2 = pd.concat([df2, pd.get_dummies(df2['Vehicle_Style'].str.replace(' ','_').astype('category'), prefix = 'd')], axis = 1)
df2 = pd.concat([df2, pd.get_dummies(df2['Engine_Fuel_Type'].str.replace(' ','_').
                                     str.replace('-','_').str.replace('(', '').str.replace(')','').
                                     str.replace('/', '_').astype('category'), prefix = 'd')], axis = 1)

# Regression model
model = smf.ols('MSRP ~ Engine_HP + Year + d_Large + d_4dr_SUV + d_Sedan + d_flex_fuel_premium_unleaded_required_E85 \
                + d_premium_unleaded_required + Engine_Cylinders + Number_of_Doors', data=df2)
                
result_3 = model.fit()
print(result_3.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   MSRP   R-squared:                       0.645
Model:                            OLS   Adj. R-squared:                  0.639
Method:                 Least Squares   F-statistic:                     106.6
Date:                Thu, 12 Nov 2020   Prob (F-statistic):          8.04e-113
Time:                        11:09:13   Log-Likelihood:                -7019.8
No. Observations:                 539   AIC:                         1.406e+04
Df Residuals:                     529   BIC:                         1.410e+04
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
=============================================================================================================
                                                coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------------
Intercept                                  8.945e+06   2.42e+06      3.698      0.000    4.19e+06    1.37e+07
Engine_HP                                   999.3568     68.028     14.690      0.000     865.719    1132.994
Year                                      -5096.6235   1211.624     -4.206      0.000   -7476.809   -2716.438
d_Large                                     6.81e+04   1.73e+04      3.937      0.000    3.41e+04    1.02e+05
d_4dr_SUV                                 -9.286e+05   8.37e+04    -11.095      0.000   -1.09e+06   -7.64e+05
d_Sedan                                   -9.675e+05   7.93e+04    -12.199      0.000   -1.12e+06   -8.12e+05
d_flex_fuel_premium_unleaded_required_E85 -1.266e+05   4.59e+04     -2.759      0.006   -2.17e+05   -3.64e+04
d_premium_unleaded_required                3770.8837   3.71e+04      0.102      0.919   -6.91e+04    7.66e+04
Engine_Cylinders                            1.07e+04   3031.860      3.528      0.000    4740.900    1.67e+04
Number_of_Doors                             4.67e+05   4.01e+04     11.650      0.000    3.88e+05    5.46e+05
==============================================================================
Omnibus:                      664.787   Durbin-Watson:                   1.141
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            74525.545
Skew:                           5.950   Prob(JB):                         0.00
Kurtosis:                      59.363   Cond. No.                     1.05e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.05e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

Conclusion: Looking at our third run, we can eliminate premium unleaded (required) as well.

3.5 Deriving the final model


As seen below, removing premium unleaded (required) does not vary R^2 but causes all p-values to be acceptable within 5% level of significance

In [46]:
# Regression model
model = smf.ols('MSRP ~ Engine_HP + Year + d_Large + d_4dr_SUV + d_Sedan \
                + d_flex_fuel_premium_unleaded_required_E85 + Engine_Cylinders + Number_of_Doors', data=df2)
                
result_4 = model.fit()
print(result_4.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   MSRP   R-squared:                       0.645
Model:                            OLS   Adj. R-squared:                  0.639
Method:                 Least Squares   F-statistic:                     120.2
Date:                Thu, 12 Nov 2020   Prob (F-statistic):          7.14e-114
Time:                        11:09:15   Log-Likelihood:                -7019.8
No. Observations:                 539   AIC:                         1.406e+04
Df Residuals:                     530   BIC:                         1.410e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
=============================================================================================================
                                                coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------------
Intercept                                  8.815e+06   2.05e+06      4.304      0.000    4.79e+06    1.28e+07
Engine_HP                                   999.6145     67.917     14.718      0.000     866.195    1133.034
Year                                      -5030.1398   1019.264     -4.935      0.000   -7032.434   -3027.846
d_Large                                    6.819e+04   1.73e+04      3.952      0.000    3.43e+04    1.02e+05
d_4dr_SUV                                 -9.287e+05   8.36e+04    -11.108      0.000   -1.09e+06   -7.64e+05
d_Sedan                                   -9.674e+05   7.92e+04    -12.210      0.000   -1.12e+06   -8.12e+05
d_flex_fuel_premium_unleaded_required_E85 -1.306e+05    2.4e+04     -5.442      0.000   -1.78e+05   -8.34e+04
Engine_Cylinders                           1.077e+04   2953.681      3.645      0.000    4962.861    1.66e+04
Number_of_Doors                            4.669e+05      4e+04     11.662      0.000    3.88e+05    5.46e+05
==============================================================================
Omnibus:                      664.486   Durbin-Watson:                   1.140
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            74428.597
Skew:                           5.945   Prob(JB):                         0.00
Kurtosis:                      59.327   Cond. No.                     8.90e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.9e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

3.6 Visualisation of Final Model Coefficients

In [47]:
#Preparing data
coefficients = pd.DataFrame(result_4.params[1:])   # Estimates of model parameters, after removing 'intercept'
coefficients[0] = StandardScaler().fit_transform(coefficients)
coefficients['colors'] = ['red' if x < 0 else 'green' for x in coefficients[0]]
coefficients = coefficients.sort_values(by= [0],inplace=False)

#Plotting
sns.set(style='darkgrid', font_scale=1.3, font="Arial", 
        rc={'axes.facecolor':'#d8d8d8'})
plt.figure(figsize=(14,10), dpi= 80)
plt.hlines(y=coefficients.index, xmin=0, xmax=coefficients[0], color=coefficients.colors, alpha=0.7, linewidth=5)

#Labelling and aesthetics
plt.yticks(coefficients.index, fontsize=15)
plt.ylabel('Attributes', fontsize=16)
plt.xlabel('Regression Coefficient', fontsize=16)
plt.title('Standardized Coefficients of Regression Model Attributes', fontdict={'size':22}, fontweight = 'bold')
plt.grid(linestyle='--', alpha=0.7)
plt.show()

From this graph, we can see that out of our 8 variables, 4-door SUVs and Sedans actually negatively affect price while the rest are variables that affect price.

3.7 Train-Test Split to confirm our model

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

In [48]:
train, test = train_test_split(df2, 
                         train_size=0.75,
                         test_size=0.25)              # Split the data set by a 75/25 ratio

# Specify the model based on the variable selection
model ='MSRP ~ Engine_HP + Year + d_Large + d_4dr_SUV + d_Sedan \
                + d_flex_fuel_premium_unleaded_required_E85 + Engine_Cylinders + Number_of_Doors'

# Fit the model to identify the coefficients
result = smf.ols(model,data=train).fit()

# Use the fitted model to make predictions
amt_hat = result.predict(test)              

print('Summary of MSRP (Luxury Cars):')
print('Mean:   ',df2['MSRP'].mean())
print('Max:    ',df2['MSRP'].max())
print('Min:    ',df2['MSRP'].min(),'\n')

rmse = ((test['MSRP'] - amt_hat)**2).mean() ** 0.5
print('Root Mean Square Error = {0:0.4f}'.format(rmse))
print('Normalised Root Mean Square Error = ', rmse/(df2['MSRP'].max()-df2['MSRP'].min()))
Summary of MSRP (Luxury Cars):
Mean:    217550.56400742114
Max:     2065902
Min:     2667 

Root Mean Square Error = 119900.2370
Normalised Root Mean Square Error =  0.058112738989873604

A low Normalised Root Mean Square Error of < 0.1 confirms that our model is quite accurate in predicting price.

Insight: Thus, our Model will be 'MSRP ~ Engine_HP + Year + d_Large + d_4dr_SUV + d_Sedan + d_flex_fuel_premium_unleaded_required_E85 + Engine_Cylinders + Number_of_Doors'


Specifically, we have identified a total of 8 variables that are significant determinants of MSRP. 4 of them are continuous variables while 4 of the other are categorical variables.

The variables are: Engine Horsepower, Year, Vehicle Size(Large), Vehicle Style (4-drive SUV and Sedan), Engine Fuel Type (Flex-fuel (premium unleaded required/E85)), Engine Cylinders, and Number of Doors

For our target 3 brands, there will be some categorical variables, ie Vehicle Style, that may not be present in those brands. As such, we will try to draw links to the categorical variables and continuous variables that show trends in our target brands

Step 4: Forming a pricing guideline to upsell car models based on their attribute

We will now determine how the chosen attributes affect our chosen brands

4.1 Preparing data

In [49]:
#Data containing the 3 brands we have identified earlier
top = (data_cleaned['Make'] == 'Bentley') | (data_cleaned['Make'] == 'Porsche') | (data_cleaned['Make'] == 'Ferrari')
top_3 = data_cleaned.loc[top] 
top_3.head()
Out[49]:
MSRP Year Engine_Fuel_Type Engine_HP Engine_Cylinders Transmission_Type Driven_Wheels Number_of_Doors Vehicle_Size Vehicle_Style Highway_MPG City_MPG Popularity Make Model Market_Category
294 160829 2002 premium unleaded (required) 400.0 8.0 MANUAL rear wheel drive 2.0 Compact Convertible 15 10 2774 Ferrari 360 Exotic,High-Performance
295 140615 2002 premium unleaded (required) 400.0 8.0 MANUAL rear wheel drive 2.0 Compact Coupe 15 10 2774 Ferrari 360 Exotic,High-Performance
296 150694 2002 premium unleaded (required) 400.0 8.0 AUTOMATED_MANUAL rear wheel drive 2.0 Compact Coupe 15 10 2774 Ferrari 360 Exotic,High-Performance
297 170829 2002 premium unleaded (required) 400.0 8.0 AUTOMATED_MANUAL rear wheel drive 2.0 Compact Convertible 15 10 2774 Ferrari 360 Exotic,High-Performance
298 165986 2003 premium unleaded (required) 400.0 8.0 MANUAL rear wheel drive 2.0 Compact Convertible 15 10 2774 Ferrari 360 Exotic,High-Performance
In [50]:
#We will use the 3 variable names below to refer to the slice of dataset concerning cars of that brand
Bentley = top_3.loc[top_3['Make']=='Bentley',:]
Porsche = top_3.loc[top_3['Make']=='Porsche',:]
Ferrari = top_3.loc[top_3['Make']=='Ferrari',:]

print('Number of Cars for chosen Brands\n')
print('Bentley:',Bentley.shape[0])
print('Porsche:',Porsche.shape[0])
print('Ferrari:',Ferrari.shape[0])
Number of Cars for chosen Brands

Bentley: 74
Porsche: 136
Ferrari: 68

4.2 Violinplot of Distribution of MSRP amongst chosen brands

In [51]:
#Preparing data
violin_plot = ['MSRP','Year','Make']
bentley = Bentley.loc[:,violin_plot]
porsche = Porsche.loc[:,violin_plot]
ferrari = Ferrari.loc[:,violin_plot]

#Plotting
MSRP_YEAR_MAKE = pd.concat([bentley,porsche,ferrari])
MSRP_YEAR_MAKE = MSRP_YEAR_MAKE[MSRP_YEAR_MAKE['Year'].isin([x for x in range(2012,2018)])] #MSRP vs Make (2012 to 2017)
print('Range of years:  [{0:0.0f}, {1:0.0f}]'.format(MSRP_YEAR_MAKE['Year'].min(),MSRP_YEAR_MAKE['Year'].max()))
sns.set(style='darkgrid', font_scale=1.3, font="Arial", 
        rc={'axes.facecolor':'#d8d8d8'})
ax = sns.violinplot(x="Make", y="MSRP", data=MSRP_YEAR_MAKE, palette = 'pastel')

#Labelling and aesthetics
ax.set_xlabel('Brand',fontsize=14)
ax.set_ylabel('MSRP',fontsize=14)
plt.title('MSRP Distribution for chosen Brands', fontsize=16, fontweight= 'bold')
plt.show()
Range of years:  [2012, 2017]

4.3 Boxplot & Scatter + Regression Line for MSRP against Engine_HP for chosen brands

Boxplot showing Horsepower distribution for chosen brands

In [53]:
#Preparing data
df = top_3 #to simplify reference to luxury data

#Plotting
plt.figure(figsize = (10,4))
sns.set(style='darkgrid', font_scale=1.3, font="calibri", 
        rc={'axes.facecolor':'#d8d8d8'})
sns.boxplot(x="Engine_HP", y="Make", data=df)

#Labelling and aesthetics
plt.title("Distribution of Horsepower for each Make",fontsize=18, fontweight= 'bold')
plt.show()

Scatter plot + regression line showing MSRP against Engine_HP for chosen brands

In [55]:
#Preparing data
cn = ['MSRP','Engine_HP']
ferrari = Ferrari.loc[:,cn]
porsche = Porsche.loc[:,cn]
bentley = Bentley.loc[:,cn]
df2 = pd.concat([ferrari,porsche,bentley],axis = 0)

min_HP = df2.loc[:,'Engine_HP'].min()
max_HP = df2.loc[:,'Engine_HP'].max()
x = np.linspace(min_HP, max_HP, 100, 0)

#linear regression coefficients
Da,Db,Dc = np.polyfit(df2['Engine_HP'],      df2['MSRP'],      2)         #Overall
Ia,Ib,Ic = np.polyfit(ferrari['Engine_HP'],     ferrari['MSRP'],     2)   #Ferrari
Wa,Wb,Wc = np.polyfit(porsche['Engine_HP'],     porsche['MSRP'],     2)   #Porsche
Ga,Gb,Gc = np.polyfit(bentley['Engine_HP'],     bentley['MSRP'],     2)   #Bentley

#Plotting
sns.set(style='darkgrid', font_scale=1.3, font="calibri", 
        rc={'axes.facecolor':'#d8d8d8'})
plt.figure(figsize = (13,10))

plt.scatter(ferrari['Engine_HP'], ferrari['MSRP'],marker='.',alpha=0.5,label='Ferrari')
plt.plot(x,Ia*x**2+Ib*x+Ic,linewidth = 4,label='Ferrari')

plt.scatter(porsche['Engine_HP'], porsche['MSRP'],marker='.',alpha=0.5,label='Porsche')
plt.plot(x,Wa*x**2+Wb*x+Wc,linewidth = 4,label='Porsche')

plt.scatter(bentley['Engine_HP'], bentley['MSRP'],marker='.',alpha=0.5,label='Bentley')
plt.plot(x,Ga*x**2+Gb*x+Gc,linewidth = 4,label='Bentley')
plt.plot(x,Da*x**2+Db*x+Dc,linewidth = 4,label='Overall')

#Labelling and aesthetics
plt.title('MSRP vs Engine Horsepower', fontsize=20, fontweight= 'bold')
plt.legend(fontsize=12)
plt.xlabel('Engine Horsepower', fontsize=15)
plt.ylabel('MSRP', fontsize=15)
plt.show()

#Statistics
print('Dataset contains target brands only\n')
lin_model = smf.ols('MSRP ~ Engine_HP',df2)
lin_result = lin_model.fit()
print('Linear Regression:')
print('Adj. R-squared: {0:0.10f}\n'.format(lin_result.rsquared_adj))

quad_model = smf.ols('MSRP ~ Engine_HP + np.square(Engine_HP)',df2)
quad_result = quad_model.fit()
print('Quadratic Regression:')
print('Adj. R-squared: {0:0.10f}\n'.format(quad_result.rsquared_adj))
if  quad_result.rsquared_adj>lin_result.rsquared_adj:
    print('Quadratic Regression is more suitable.')
Dataset contains target brands only

Linear Regression:
Adj. R-squared: 0.6066455833

Quadratic Regression:
Adj. R-squared: 0.6082746323

Quadratic Regression is more suitable.

4.4 Scatterplot and Regression Line for MSRP against Engine_Cylinders for chosen brands

In [57]:
#Preparing data
cn = ['MSRP','Engine_Cylinders']
ferrari = Ferrari.loc[:,cn]
porsche = Porsche.loc[:,cn]
bentley = Bentley.loc[:,cn]
df2 = pd.concat([ferrari,porsche,bentley],axis = 0)

min_Engine_Cylinders = df2.loc[:,'Engine_Cylinders'].min()
max_Engine_Cylinders = df2.loc[:,'Engine_Cylinders'].max()
x = np.linspace(min_Engine_Cylinders, max_Engine_Cylinders, 100, 0)

#linear regression coefficients
Da,Db,Dc = np.polyfit(df2['Engine_Cylinders'],         df2['MSRP'],         2)         #Overall
Ia,Ib,Ic = np.polyfit(ferrari['Engine_Cylinders'],     ferrari['MSRP'],     2)         #Ferrari
Wa,Wb,Wc = np.polyfit(porsche['Engine_Cylinders'],     porsche['MSRP'],     2)         #Porsche
Ga,Gb,Gc = np.polyfit(bentley['Engine_Cylinders'],     bentley['MSRP'],     2)         #Bentley

#Plotting
sns.set(style='darkgrid', font_scale=1.3, font="calibri", 
        rc={'axes.facecolor':'#d8d8d8'})
plt.figure(figsize = (13,10))

plt.scatter(ferrari['Engine_Cylinders'], ferrari['MSRP'],marker='.',alpha=0.5,label='Ferrari')
plt.plot(x,Ia*x**2+Ib*x+Ic,linewidth = 4,label='Ferrari')

plt.scatter(porsche['Engine_Cylinders'], porsche['MSRP'],marker='.',alpha=0.5,label='Porsche')
plt.plot(x,Wa*x**2+Wb*x+Wc,linewidth = 4,label='Porsche')

plt.scatter(bentley['Engine_Cylinders'], bentley['MSRP'],marker='.',alpha=0.5,label='Bentley')
plt.plot(x,Ga*x**2+Gb*x+Gc,linewidth = 4,label='Bentley')
plt.plot(x,Da*x**2+Db*x+Dc,linewidth = 4,label='Overall')

#Labelling and aesthetics
plt.title('MSRP vs Engine Cylinders', fontsize=20, fontweight= 'bold')
plt.legend(fontsize=12)
plt.xlabel('Engine_Cylinders', fontsize=15)
plt.ylabel('MSRP', fontsize=15)
warnings.filterwarnings("ignore")
plt.show()

#Statistics
print('Dataset contains target brands only\n')
lin_model = smf.ols('MSRP ~ Engine_Cylinders', df2)
lin_result = lin_model.fit()
print('Linear Regression:')
print('Adj. R-squared: {0:0.10f}\n'.format(lin_result.rsquared_adj))

quad_model = smf.ols('MSRP ~ Engine_Cylinders + np.square(Engine_Cylinders)',df2)
quad_result = quad_model.fit()
print('Quadratic Regression:')
print('Adj. R-squared: {0:0.10f}\n'.format(quad_result.rsquared_adj))
if quad_result.rsquared_adj>lin_result.rsquared_adj:
    print('Quadratic Regression is more suitable.')
Dataset contains target brands only

Linear Regression:
Adj. R-squared: 0.4865624601

Quadratic Regression:
Adj. R-squared: 0.5840244459

Quadratic Regression is more suitable.

4.5 Barplots of Distribution of Significant Categorical Variables for Target Car Brands

Grouped plots of various value counts vs brand

The categorical variables are:

  1. Vehicle Size(Large),
  2. Engine Fuel Type (Flex-fuel (premium unleaded required/E85))
  3. Vehicle Style (4-drive SUV)
  4. Vehicle Style (Sedan)
In [58]:
#Preparing data
Bentley = top_3.loc[top_3['Make']=='Bentley',:]
Porsche = top_3.loc[top_3['Make']=='Porsche',:]
Ferrari = top_3.loc[top_3['Make']=='Ferrari',:]
cat1 = ('Vehicle_Size', 'Large')
cat2 = ('Engine_Fuel_Type','flex-fuel (premium unleaded required/E85)')
cat3 = ('Vehicle_Style', 'Sedan')
cat4 = ('Vehicle_Style', '4dr SUV')
categories = [cat1,cat2,cat3,cat4]
brands = [Bentley, Porsche, Ferrari]
brand_name = ['Bentley', 'Porsche', 'Ferrari']

ans = [[],[],[],[]]
r = 0
for i in categories:
    res = ans[r]
    for name in brands:    
        n = name[i[0]].value_counts().get(i[1]) #n gives the count according to cat for each brand
        if n is None:
            n = 0
        res.append(n)
    r = r+1

#Plotting
sns.set(style='darkgrid', font_scale=1.3, font="calibri", 
        rc={'axes.facecolor':'#d8d8d8'})
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(15, 10))
ax = ax.reshape(-1)
titles = ['Qty of Cars that are Large per brand', 
          'Qty of Cars with flex-fuel (premium unleaded required/E85) per brand',
         'Qty of Sedans per brand','Qty of 4 door SUVs per brand']
colors = ['#6497b1', '#35b899', '#b3c0e2','#ca5e27']
for i,j in enumerate(ax[:4]):
    j.bar(brand_name,ans[i],color = colors[i], edgecolor='black')
    j.title.set_text(titles[i])
    j.set_ylabel('Number of Cars')

#Labelling and aesthetics
plt.suptitle("Barplots showing Distribution of Significant Categorical Variables for each Brand", 
             fontsize=20, fontweight= 'bold')
plt.tight_layout()
plt.subplots_adjust(top = 0.9)
plt.show()
plt.show()

4.6 Grouped Barplot of MSRP against Years for Chosen Brands

In [60]:
print('Range of years:  [{0:0.0f}, {1:0.0f}]'.format(MSRP_YEAR_MAKE['Year'].min(),MSRP_YEAR_MAKE['Year'].max()))

#Plotting
plt.figure(figsize = (10,4))
ax = sns.barplot(x="Make", y="MSRP", hue="Year", data=MSRP_YEAR_MAKE)

#Labelling and aesthetics
ax.set_xlabel('Brand',fontsize=14)
ax.set_ylabel('MSRP',fontsize=14)
plt.title('Prices of cars for each brand by year',fontsize=18, fontweight= 'bold')
ax.legend(loc='upper center', frameon=False)
plt.show()
Range of years:  [2012, 2017]

As seen in the figure, from 2012 to 2017, expect for Porsche, the latest models indicate a higher MSRP. Hence, Python Inc. should sell newer car models for Bentley and Ferarri at a higher price.

Appendix

Additional Graphs that may supplement our project

Analysis of Popularity and Quantity of car brands sold to determine which 3 brands to pick

  • Note: Popularity score is scaled down by a factor of 1000 to match the axis
In [62]:
#Preparing data
xaxislabels = pop_sorted['Brand']

#Plotting
plt.figure(figsize=(15,4))
plt.plot(xaxislabels, pop_sorted['Popularity'] / 1000, c= 'b', alpha = 0.5, marker ='x', label = 'Popularity') 
plt.plot(xaxislabels, pop_sorted['Quantity Sold (% of total)'], c= 'r', alpha = 0.5, marker = 'o', label = 'Quantity Sold (% of total)')

#Labelling and aesthetics
plt.xlabel('Car Brand', fontsize = 12)
plt.xticks(rotation = 90)
plt.ylabel('Popularity & Quantity Sold (% of total)' , fontsize = 12)
plt.title('Popularity and Quantity for luxury brands', fontweight= 'bold')
plt.legend()
plt.show()

Popularity of Brand v.s. Quantity Sold (% of total) for each luxury car brand

In [64]:
#### We filter out the cars to only luxury brands, and then select top 3 brands to analysexaxislabels = Brand_MSRP['Brand']

#Plotting
width = 0.3
xa = np.arange(len(xaxislabels))
plt.figure(figsize=(15,4))
plt.bar( xa -0.5 * width, pop_sorted['Popularity'] / 1000, 
        width = width, color= 'b', alpha = 0.7, label = 'Popularity')
plt.bar( xa+0.5 * width, pop_sorted['Quantity Sold (% of total)'], 
        width = width, color= 'r', alpha = 0.7, label = 'Quantity Sold (% of total)')

#Labelling and aesthetics
plt.xlabel('Car Brand', fontsize = 12)
plt.xticks(xa , xaxislabels ,rotation = 90)
plt.ylabel('Popularity & Quantity Sold (% of total)' , fontsize = 12)
plt.title('Popularity of Brand v.s. Quantity Sold (% of total)' , fontsize = 15, fontweight= 'bold')
plt.legend()
plt.show()

Group bar plot of Number of doors against MSRP for chosen brands

In [66]:
#Preparing data
bar_plot = ['MSRP','Number_of_Doors','Make']
bentley = Bentley.loc[:,bar_plot]
porsche = Porsche.loc[:,bar_plot]
ferrari = Ferrari.loc[:,bar_plot]

#Plotting
plt.figure(figsize = (15,5))
door_data = pd.concat([bentley,porsche,ferrari])

#Labelling and aesthetics
ax = sns.barplot(x="Number_of_Doors", y="MSRP", hue="Make", data=door_data)
ax.set_xlabel('Number of Doors',fontsize=14)
ax.set_ylabel('MSRP',fontsize=14)
plt.title('Group bar plot of no. of doors vs MSRP', fontweight= 'bold')
plt.show()

The Group bar plot shows that PYTHON INC should sell Bently cars with 2 doors since the MSRP would be higher. For Porsche, there is no significant difference. Lasly, for Ferrari, it only has cars which have 2 doors.

Barplot of MSRP against Vehicle Size (Large) for Chosen Brands

In [67]:
Bentley = top_3.loc[top_3['Make']=='Bentley',:]
Porsche = top_3.loc[top_3['Make']=='Porsche',:]
Ferrari = top_3.loc[top_3['Make']=='Ferrari',:]

Bentley_VSL = Bentley.loc[(Bentley['Vehicle_Size']=='Large') & (Bentley['Engine_Cylinders']==8.0)]
Porsche_VSL = Porsche.loc[(Porsche['Vehicle_Size']=='Large') & (Porsche['Engine_Cylinders']==8.0)]
Ferrari_VSL = Ferrari.loc[(Ferrari['Vehicle_Size']=='Large') & (Ferrari['Engine_Cylinders']==8.0)]

print('Vehicle Size(Large) per brand:')
print('Bently: ',Bentley_VSL['Vehicle_Size'].count())
print('Porsche: ', Porsche_VSL['Vehicle_Size'].count())
print('Ferrari: ', Ferrari_VSL['Vehicle_Size'].count())
Vehicle Size(Large) per brand:
Bently:  28
Porsche:  10
Ferrari:  0
In [68]:
#Preparing data
bar_plot_VS = ['MSRP','Vehicle_Size','Make']
bentley = Bentley.loc[:,bar_plot_VS]
porsche = Porsche.loc[:,bar_plot_VS]
ferrari = Ferrari.loc[:,bar_plot_VS]

#Plotting
plt.figure(figsize = (15,5))
VS_data = pd.concat([bentley,porsche,ferrari])

#Labelling and aesthetics
ax = sns.barplot(x="Vehicle_Size", y="MSRP", hue="Make", data=VS_data)
ax.set_xlabel('Vehicle Size',fontsize=14)
ax.set_ylabel('MSRP',fontsize=14)
plt.title('Group bar plot of vehicle sizes vs MSRP', fontweight= 'bold')
plt.show()

Barplot of MSRP against Vehicle Style for Chosen Brands

To test: Vehicle Style (4-drive Sedan and SUV)

In [69]:
#Preparing data
bar_plot_VS2 = ['MSRP','Vehicle_Style','Make']
bentley = Bentley.loc[:,bar_plot_VS2]
porsche = Porsche.loc[:,bar_plot_VS2]
ferrari = Ferrari.loc[:,bar_plot_VS2]
VS2_data = pd.concat([bentley,porsche,ferrari])

#Plotting
plt.figure(figsize = (15,5))
ax = sns.barplot(x="Vehicle_Style", y="MSRP", hue="Make", data=VS2_data)

#Labelling and aesthetics
ax.set_xlabel('Vehicle Style',fontsize=14)
ax.set_ylabel('MSRP',fontsize=14)
plt.title('Group bar plot of vehicle type vs MSRP', fontweight= 'bold')
plt.show()

Barplot of MSRP against Engine Fuel Type for Chosen Brands

To test: Engine Fuel Type (Flex-fuel and flex-fuel (premium unleaded required/E85)),

In [71]:
#Preparing data
bar_plot_FT = ['MSRP','Engine_Fuel_Type','Make']
bentley = Bentley.loc[:,bar_plot_FT]
porsche = Porsche.loc[:,bar_plot_FT]
ferrari = Ferrari.loc[:,bar_plot_FT]
FT_data = pd.concat([bentley,porsche,ferrari])

#Plotting
plt.figure(figsize = (15,5))
ax = sns.barplot(x="Engine_Fuel_Type", y="MSRP", hue="Make", data=FT_data)

#Labelling and aesthetics
plt.title('Group bar plot of Fuel Type vs MSRP', fontsize=18, fontweight= 'bold')    
ax.set_xlabel('Engine Fuel Type',fontsize=14)
ax.set_ylabel('MSRP',fontsize=14)
plt.xticks(fontsize=10)
plt.show()

Brand vs Popularity of all cars

In [73]:
#Preparing data
sorted_pop = Brand_MSRP.sort_values(by= ['Popularity'],inplace=False)
brand = sorted_pop['Brand']
popularity = sorted_pop['Popularity']

#Plotting of graph
plt.figure(figsize=(16, 4))       
plt.bar(brand, popularity, width=0.7, color='#db7530', alpha=0.9)

#Labelling and aesthetics
plt.title('Brand Vs Popularity', fontsize=18, fontweight= 'bold')   
plt.xticks(rotation = 90, fontsize=8)
plt.xlabel('Brand')                     
plt.ylabel('Popularity')                     
plt.show()   

3-variable data visualisation of Average MSRP vs Quantity Sold for each brand

In [74]:
xaxislabels = sorted_msrp['Brand']

#Plotting
fig, ax1 = plt.subplots(figsize=(20,10)) # Using subplot function
ax2 = ax1.twinx() # Mirror the y-axis, making a secondary axis
ax1.bar(xaxislabels, sorted_msrp['Quantity Sold'], alpha = 0.8, color = 'b', label ='Quantity of Cars Sold')
ax2.plot(xaxislabels, sorted_msrp['Average MSRP'], marker = 'x', c = 'r', label ='Average MSRP of Brand' )

#Labelling and aesthetics
ax1.set_xlabel('Car Brand', fontsize = 12)
ax1.set_ylabel('Number of cars sold in sample' , fontsize = 12)
ax2.set_ylabel('Average MSRP of Car Brand', fontsize = 12, rotation = 270)
plt.title('Average MSRP and quantity sold for each brand', fontsize=18, fontweight= 'bold')  
plt.setp(ax1.xaxis.get_majorticklabels(), rotation = 90) # Setting the xtick labels to rotate 90 degrees
fig.legend(loc='upper center')
plt.show()
In [ ]: